{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## doc2vec\n",
    "\n",
    "This is an experimental code developed by Tomas Mikolov found in the word2vec google group: https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ\n",
    "\n",
    "This is not yet available on Pypi you need the latest master branch from git.\n",
    "\n",
    "The input format for `doc2vec` is still one big text document but every line should be one document prepended with an unique id, for example:\n",
    "\n",
    "```\n",
    "_*0 This is sentence 1\n",
    "_*1 This is sentence 2\n",
    "```\n",
    "\n",
    "### Requirements\n",
    "\n",
    "1. [`nltk`](http://www.nltk.org/)\n",
    "2. Download some data: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n",
    "3. Untar that data: `tar -xvf aclImdb_v1.tar.gz`\n",
    "\n",
    "## Preprocess\n",
    "\n",
    "Merge data into one big document with an id per line and do some basic preprocessing: word tokenizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import nltk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "directories = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_file = open('/Users/drodriguez/Downloads/alldata.txt', 'w')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "id_ = 0\n",
    "for directory in directories:\n",
    "    rootdir = os.path.join('/Users/drodriguez/Downloads/aclImdb', directory)\n",
    "    for subdir, dirs, files in os.walk(rootdir):\n",
    "        for file_ in files:\n",
    "            with open(os.path.join(subdir, file_), \"r\") as f:\n",
    "                doc_id = \"_*%i\" % id_\n",
    "                id_ = id_ + 1\n",
    "\n",
    "                text = f.read()\n",
    "                text = text\n",
    "                tokens = nltk.word_tokenize(text)\n",
    "                doc = \" \".join(tokens).lower()\n",
    "                doc = doc.encode(\"ascii\", \"ignore\")\n",
    "                input_file.write(\"%s %s\\n\" % (doc_id, doc))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_file.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## doc2vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import word2vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "word2vec.doc2vec('/Users/drodriguez/Downloads/alldata.txt', '/Users/drodriguez/Downloads/vectors.bin', cbow=0, size=100, window=10, negative=5, hs=0, sample='1e-4', threads=12, iter_=20, min_count=1, verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction\n",
    "\n",
    "Is possible to load the vectors using the same wordvectors class as a regular word2vec binary file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import word2vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = word2vec.load('/Users/drodriguez/Downloads/vectors.bin')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.vectors.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "The documents vector are going to be identified by the id we used in the preprocesing section, for example document 1 is going to have vector:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model['_*1']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can ask for similarity words or documents on document `1`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "indexes, metrics = model.cosine('_*1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.generate_response(indexes, metrics).tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now its just a case of matching the id to the data created on the preprocessing step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
